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Single domain proteins are thought to be tightly packed. The introduction of voids by mutations 
is often regarded as destabilizing. In this study we show that packing density for single domain 
proteins decreases with chain length. We find that the radius of gyration provides poor description 
of protein packing but the alpha contact number we introduce here characterize proteins well. We 
further demonstrate that protein-like scaling relationship between packing density and chain length 
is observed in off-lattice self-avoiding walks. A key problem in studying compact chain polymer is 
the attrition problem: It is difficult to generate independent samples of compact long self-avoiding 
walks. We develop an algorithm based on the framework of sequential Monte Carlo and succeed in 
generating populations of compact long chain off-lattice polymers up to length N — 2,000. Results 
based on analysis of these chain polymers suggest that maintaining high packing density is only 
characteristic of short chain proteins. We found that the scaling behavior of packing density with 
chain length of proteins is a generic feature of random polymers satisfying loose constraint in 
compactness. We conclude that proteins are not optimized by evolution to eliminate packing voids. 
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I. INTRODUCTION 



Geometric considerations have lead to important in- 
sights about protein structuresiSi 3 ^^. Voids are simple 
geometric features that represent packing defects inside 
protein structures. For multisubunit proteins such as 
GroEL and potassium channel, voids or tunnels of large 
size are formed by the spatial arrangement of multiple 
subunits, and are essential for the biological functions 
of these proteins^. In this study, we focus on voids 
formed due to packing defects that are not directly in- 
volved in protein function. For this purpose, we choose 
to study only structures of single domain proteins. Al- 
though these proteins are well known to be compact, and 
their interior is frequently thought to be solid- lik o 9 i 10 , 
recent calculations showed that there are also numer- 
ous voids buried in the protein interionii. The impor- 
tance of tight packing in single chain protein is widely 
appreciated: packing is thought to be important for 
protein stabilitjiiSii&ii, for kinetic nucleation of protein 
foldingiiiiiii, and for successful design of novel proteins 
following a predefined backbone^. The conservation of 
amino acid residues during evolution may also be corre- 
lated with tightly packed sitesi^iSiil. In contrast, the 
potential roles of voids in affecting protein stability and 
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in influencing tolerance to mutations and designability of 
proteinsiSiiS are not well understood. 

An important parameter describing packing is the 
packing density pd, which is a quantitative measure of 
the voids and was first introduced to study proteins by 
structural biologists. This concept has been widely used 
in protein chemistr y 8 i 13 . The scaling relationship of pd 
and chain length N was first studied in reference^, pd 
can be thought of as the physical volume v v d w occupied 
by the union of van der wall atoms, divided by the volume 
of an envelope v env that tightly wraps around the body 
of atoms: pd = v v d w /v e mr^- Voids contained within the 
molecule will not be part of the van der Waals volume 
v v dw, but will be included in v env . Using geometric al- 
gorithms, v v dw, Venv and pd can be readily computed for 
protein structures in the Protein Data Bank22i2i. 

In this work, we further study the scaling behavior 
of packing density pd with chain length of single do- 
main proteins and explore the determinants of the ob- 
served scaling behavior. We seek to answer the follow- 
ing questions: Is the scaling behavior of pd unique to 
proteins? Are proteins optimized during evolution to 
eliminate packing voids? We introduce two new packing 
parameters n a (the alpha contact number) and z a (the 
alpha coordination number). We show that n a charac- 
terizes protein packing very well with a linear scaling 
relationship with the chain length, and that a widely 
used parameter, the radius of gyration R gi character- 
izes protein packing poorly. To overcome the attrition 
problem of low success rate in generating compact long 
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chain polymers, we develop an algorithm based on se- 
quential Monte Carlo importance sampling and succeed 
in obtaining thousands of very compact long chain off- 
lattice polymers up to N = 2, 000. We demonstrate that 
the scaling behavior of pd for proteins can be qualita- 
tively reproduced by randomly generated polymers with 
rudimentary constraints of n a . Our simulation studies 
lead us to conclude that proteins are not optimized to 
eliminate voids during evolution. Rather, voids in pro- 
teins are a generic feature of random polymers with a 
"reasonable" (as measure by z a ) compactness. 

The paper is organized as follows. In the Method sec- 
tion, we first describe briefly how pd and n a are computed 
from the dual simplicial complex of protein structure, 
and introduce an off-lattice discrete model for generat- 
ing random polymer conformations. We next describe 
the sequential Monte Carlo importance sampling and re- 
sampling techniques that allow us to generate adequate 
samples satisfying various criteria of n a . In the Results 
section, we begin with the characterization of void prop- 
erties of proteins by both pd and R g . We then show 
the linear scaling behavior of n a found in proteins. The 
scaling behavior of pd of random polymers generated by 
sequential Monte Carlo with chain length is discussed 
later. We conclude with summary and discussion of our 
results. 



II. METHODS 

Protein Data. To avoid complications of multichain 
and multidomain proteins, we examine the packing den- 
sity of proteins of single domain proteins. We collect 
proteins from the Pdbselect database^ that contains 
only one domain, as defined as single chains in the Scop 
database with one numerical labels. 

Dual Simplicial Complex, Alpha Coordination Number, 
and Packing Density. We use alpha shape to characterize 
the geometry of protein structure. Alpha shape has been 
successfully applied to study a number of problems in 
proteins, including void measurement, binding site char- 
acterization, protein packing, electrostatic calculations, 
and protein hydrationa 1:L i 20 i 2:L i 24 i 25 i 26 i 27 i 28 . Briefly, we 
first obtain a Delaunay simplicial complex of the molecule 
from weighted Delaunay triangulation, which decom- 
poses the convex hull of atom centers into tetrahedra 
(3-simplices), triangles (2-simplices), edges (1-simplices), 
and vertices (0-simplices). We then obtain the dual 
simplicial complex of the protein molecule by removing 
any tetrahedra, triangles, and edges whose corresponding 
Voronoi vertices, edges, and planar facets are not fully 
contained within the protein molecul o 29 ' 30 . The edges 
between atoms that are not connected by bonds corre- 
sponds to nonbonded alpha contacts. The total sum of 
the number of edges for each atom is the total number of 
alpha contacts n a . It reflects the total number of atoms 




FIG. 1: The alpha contacts in a toy molecule. In this molecule, 
both atom 1 and atom 2 have 4 alpha contacts. The number 
of atoms n = 9, the number of alpha contacts is n a = 22 
(twice the number of edges), and the alpha coordination number 

z a = n a /n f» 2.4. 

that are in physical nearest neighbor contact with other 
atoms. These atoms have volume overlap and their cor- 
responding weighted Voronoi cells intersect. The alpha 
coordination number /n, where n is the total 

number of atoms in the molecule (see Figure In our 
calculation, we only consider nonbonded alpha contacts. 
Details of the theory and computation of alpha shape and 
dual simplicial complex can be found elsewhereiis^fliSiiSi. 

We follow previous work in reference^ and define pack- 
ing density pd as: 

Vvdw V v dw 
Pd = = — 

Venv ^ms ~r V vo ids 

where v v d w ,v ms and v VO id s are van der Waals volume, 
molecular surface volume and the void volume of the 
molecule, respectivelji22>. Packing density is computed 
with a solvent probe radius 1.4 A, as described in 
referenceii. 

Growth Model for Off-Lattice Random Polymers. We 
use a modified off-lattice discrete m-state model first 
developed in reference^ to generate self-avoiding walks 
(SAWs) in three dimensional space. All monomers are 
treated as balls with a radius of 1.7 A. For monomers i 
and j that are not sequence near neighbors Qi — j\ > 2), 
the Euclidean distance d(i,j) between them must be 
greater than 2 x 1.7 A so they are self- avoiding. Sequence 
neighboring monomers are connected by a bond of length 
1.5 A. 

We use a chain growth model to obtain conformation 
of polymer of specified length-^. There are m = 32 possi- 
ble states where the next monomer can be placed. They 
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FIG. 2: The 32-state discrete model for chain growth. There 
are 32 possible positions for adding the next monomer. They are 
located on a sphere of radius 1.5 A, but placement on the cap 
with an angle < 60 ° from the entering bond is forbidden. The 
surface is divided into four stripes of equal area. Eigh positions 
are placed evenly on each strip, a is the bond angle. 

are evenly distributed spatially on a sphere of radius 1.5 
A centered at the current monomer. We forbid the place- 
ment of the new monomer anywhere on a cap of the 
sphere with an angle < 60 from the entering bond. This 
ensures that there are no unnatural acute sharp bond an- 
gles. The remaining sphere is divided into 4 strips, each 
may have different width but is of equal surface area. For 
the 32 possible states, we place uniformly 8 points at the 
midline of each of the 4 strips. Following Park & Levitt^,, 
the coordinates of each state are parameterized by two 
angels a and r for ease of computation, a is the bond 
angle formed by the i — 1, i and i + 1-th monomers, r is 
the torsion angle formed by four consecutive monomers. 

Approximately Maximum Compact Polymer. In addi- 
tion, we generate polymers that are approximately max- 
imum compact based on the face centered cubic (FCC) 
packing of balls of 1.7A radii. For hard spheres, FCC 
packing has recently been proved to have the tightest 
packing2^ii2&. Because the distance between two balls in 
canonical FCC packing is 2 x 1.7 = 3.4 A, which is greater 
than the bond length 1.5 A, we shorten the distance along 
bonds connecting contacting balls of radius 1.7A to 1.5 
A. This mimics the bond length of the model polymer. 
Unlike FCC packing of hard spheres, bonded monomers 
here are allowed to have volume overlaps. Additionally, 
there are some boundary effects because bonds connect- 
ing balls in different layer have a distance > 1.5 A. Al- 
though mathematically unproven, we conjecture that this 
artificially constructed polymer represents conformations 
of SAWs that have very close to maximum compactness. 

The packing density of canonical FCC packing by our 
method is 0.74ii. As described earlier, although FCC 
packing contains no voids, there are packing crevices or 
dead spaces that do contribute to the calculation of pd 
by our definitionii. In approximately maximum compact 
polymer, because the distance between bonded balls is 
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FIG. 3: The steps of sequential Monte Carlo method applied to 
improve sampling efficiency. 

shorter than that in FCC packing, pd can be as high as 
0.80 for polymers with a range of chain length. 

Importance Sampling with Sequential Monte Carlo. 
Since we are simulating compact conformations that re- 
semble proteins, we need an efficient method to generate 
adequate number of conformations satisfying protein- like 
compactness criteria. Here we use a sequential Monte 
Carlo (SMC) chain growth strategj*2£i21 ; which combines 
importance sampling and the growth method. The main 
steps are shown in Figure [3] 

Denote the conformation of a polymer of length t as 
(xi, . . . ,Xt), where xt is the 3-dimensional location of 
the i-th monomer. Starting with fixed initial location 
{x\,X2), we grow polymers by sequentially adding one 
monomer x t +\ to occupy one of the 32-states connect- 
ing to the last monomer xt of the current chain. The 
monomer x t +\ is randomly placed according to a sam- 
pling probability gt+i{x t +i\x\ . . .Xt). In this study, the 
following function gt+i is used. Let lo be one of the 32- 
states connected to Xt that satisfies the self-avoiding cri- 
terion. First we intitialize the number of neighbors n e (lo) 
to a; as 1, and the Euclidean distance from lo to the near- 
est neighbor monomer d(u) to 6 A. We then increment 
n e (uj) by the number of existing monomers within a dis- 
tance of 6.0 A to to. Among these monomers, we identify 
the monomer x s that is the nearest neighbor with the 
shortest Euclidean distance d to lo. We require in addi- 
tion that the sequence separation \s — t\ > 3 so x s and 
Xt are not sequence near neighbors. The distance d(o>) is 
then replaced by the value of d. The sampling probability 
is set as: 

g t +i(x t+ i = lo\xi ...x t ) oc e~ E (w)/T 

where E'(lo) — In „ ^ * s an artificial "packing en- 
ergy" favoring more compact conformations, and T" is 
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a pseudo-temperature controlling the behavior of sam- 
pling. Using this energy function, growth to position 
lj with close nearest neighbor (small d(u)) and a large 
number of neighbors within a 6 A distance (large n e (u>)) 
is favored. Here the adjustable parameter c is used to 
balance the effect of d(cu) and n e (uj). T' controls the 
importance of compactness. At low T", conformations 
generated are compact, but at high T", the compactness 
criterion becomes less important. 

According to the sequential Monte Carlo framework, 
the importance weight Wt+i for the sampled conforma- 
tion (ii, . . . , Xt+i) is updated as: 

7Tt+i(xi . ..X t +l) 
W t +1 = W t ■ — r ; r 

Tr t {xi . ..3f t ) • g t+ i{x t +i\xi ...x t ) 

where irt+i{x\ . . .x t +\) is the target distribution 
at t + 1. With a set of weighted samples 

{(xi\ ■ ■ ■ , in' ) , wi 3 ' }jLi , statistical inference on the tar- 
get distribution w n (xi, . . . , x n ) can be made using 

E„„[h(x 1 ,...,x n )] = — m (1) 

E,=l W n 

for most of proper function h. 

The Target Distribution. We wish to generate ran- 
dom samples of polymer with different compactness cri- 
terion. This is achieved by using a target distribution ir n 
which is uniform among all SAWs satisfying a compact- 
ness constraint. The constraint is set as follows. First, 
for each chosen pair values of (T',c), we use the func- 
tion e~ E ( C )/ T to generate 500 random conformations as 
a trial run. Ignoring the importance weights, we calcu- 
late the mean alpha coordination number z*(n, T',c) of 
all the generated conformations. Then we set the target 
distribution 7r* as the uniform distribution of all SAWs 
satisfying z a 6 (0.8 • z*(n, T', c), 1.2 • z*(n, T', c)). 

We then rerun a large simulation with the same (T", c) 
parameters and harvest the conformations, using uniform 
distribution with no restriction on the intermediate tar- 
get distribution w t but take the truncated distribution 
7r* as the final target distribution. The truncation is 
archived by discarding all generated conformations that 
does not satisfy the constraint. Typically, the truncation 
rate is very small (< 0.1%). The bias in sampling is fully 
compensated by proper weighting. 

Resampling. Because it is easy to have self-avoiding 
walks to grow into a dead-end, we use resampling to 
replace dead samples or samples with small weight to 
improve sampling efficiency 37 . Intuitively, we check reg- 
ularly during the chain growth process whether a partic- 
ular chain is stuck in a dead-end, or is too extended, or 
has too little weight. If so, this chain is replaced by the 
replicate of another chain that has the desired compact- 
ness. Both duplicate chains will then continue to grow, 
and the final two surviving chains will be correlated up to 
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FIG. 4: The steps of the resamplig procedure for the sequential 
Monte Carlo method. 



the duplication event. Conformations of the monomers 
added after the duplication will be uncorrelated. This 
resampling technique targets our simulation to specified 
configuration space without introducing too much bias 
where conformations all have desired compactness (see 
Figure 0}. 

Although we found that the total contact number n a 
is an excellent parameter for characterizing protein, its 
calculation involves expensive computation of weighted 
Delaunay triangulation and alpha shape. We decide to 
use R g as a surrogate parameter during resampling. For 
resampling, we use the empirical relationship R g (n) = 
2.2 • ?i ' 38 , where n is the number of monomers in the 
polymer, as described in reference 3 ^. This relationship 
has been used as a constraint in NMR protein structure 
determination—. We have the following pseudo code for 
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FIG. 5: The relationship between packing density pd, radius 
of gyration R g , and the number of residue N in single domain 
proteins. 

resampling^!: 

Procedure Resampling (m,d s ,R t ) 

II m: Monte Carlo sample size, d s : steps of looking-back. 

// R t : targeting R g . 

k <— number of dead conformations. 

Divide m — k samples randomly into k groups. 

for group i = 1 to k 

Find conformations not picked in previous d s steps. 
//Pick the best conformation Pj 
Pj «— polymer with min \R g — R t \ 
Replace one of k dead conformations with Pj 
Assign both copies of Pj half its original weight. 

endf or 

Here d s is used to maintain higher diversity for resam- 
pled conformations. That is, conformation that has been 
picked in the past d s steps are not available for resam- 
pling. 

After resampling, the samples with their adjusted 
weights remain to be properly weighted with respect to 
the original target distribution. We can then calculate 
the expected alpha coordination number z a , expected 
packing density pd, or expected value of any other func- 
tion h using Equation With these sampling and re- 
sampling strategies, we can successfully grow thousands 
of self-avoiding walks of chain length up to 2, 000 using 
a Linux cluster of 40 CPUs. 



III. RESULTS 

Packing Density. Figure [SJi shows the correlation of 
packing density pd with the number of residues N in 
real proteins. Similar relationship has been observed in 
reference^. Here we further restrict the samples to be 
of single domain by SCOP annotation^. We found that 
Pd decreases with chain length. That is, short chain pro- 
teins have high packing density pd, but pd decreases from 



FIG. 6: The scaling behavior of the total number of nonbonded 
atomic alpha contacts with the total number of atoms for single 
domain proteins. Here only contacts from different residues are 
counted. 



> 0.85 to about 0.74 — 0.75 when the chain length reaches 
about 190 residues. After reaching this length, proteins 
seem to be indifferent about the existence of voids. This 
suggests that maintaining high packing density is only 
characteristic of short chain proteins. 

Radius of Gyration of Proteins. To identify the fac- 
tors that dictate the scaling behavior of pd with residue 
number N, we need to determine whether such scaling 
is due to physical constraints of statistical mechanics or 
the product of extensive optimization by evolution. Wc 
study this problem by examining the scaling behavior of 
Pd with N in random chain polymers generated by com- 
puter. 

Because of the enormity of conformational space, we 
focus on random polymers that resemble proteins in some 
rudimentary sense. One possible criterion is the radius 
of gyration R g . This parameter has been widely used as 
a macroscopic description of protein packing. For single 
domain proteins, however, we found that there is sub- 
stantial variance in R g for proteins of the same chain 
length (Figure |SJd) . Therefore, R g characterizes protein 
packing rather poorly, and is unsuitable as a criterion for 
generating protein-like polymers for our purpose. 

Alpha Contacts. An alternative global description of 
protein structure is the total number of nonbonded al- 
pha contacts n a defined by the dual simplical complex 
of the protein. In Figure & we plot n a against the total 
number of atoms in the molecule n. As discussed before, 
these contacts are identified by computing the dual sim- 
plicial complex of the molecul o 1:L i 20 . The total number 
of contacts n a scales linearly with n. It also scales lin- 
early with the protein chain length (or residue number N, 
data not shown). Regression leads to a linear relation- 
ship of n a = 4.28 • n - 432, with R 2 = 0.995. The alpha 
contact number n a therefore provides a more accurate 
global characteristic of protein than radius of gyration 
R g . This linear scaling relationship of packing related 
property is similar to other linear scaling relationships 
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FIG. 7: Examples of self-avoiding walks of length 1,000 gen- 
erated with different sampling probability function e~ E ( C '' T 
using different (T',c) values, (a). Conformations generated 
with (T',c) = (1.0,0.0); (b). (T',c) = (0.67,0.0) and (c). 
(T',c) = (0.1,0.6). (d). Approximately maximally compact 
conformation. 



observed for protein, for example, of empirical solvation 
energy^, protein surface area and protein volumeii with 
chain length. It is interesting to note that the value of 
x-axis intercept for n of the linear regression model sug- 
gests that the size of a minimum protein would be in the 
order of 100 atoms, or about 9-10 residues. 

The details of the linear scaling relationship are further 
examined in Figure EJx It is a replot of Figure [S^ after 
normalization by n. It showed that for proteins with 
1,000 atoms or more (> 120 residues), the parameter 
alpha coordination number z a = n a /n is a constant of 
about 4.2. For smaller proteins (n < 1,000), z a ranges 
from 2.5 to 4.0. A nonlinear curve fitting leads to the 
relationship z a = a — b/n, where a = 4.27 ± 0.03, and 
b = 4.2 x 10 2 ± 26. We decide to use z a as the criterion 
to select random polymers generated computationally for 
packing analysis. 

Targeted Sampling of Random Chain Polymer. Fig- 
ure shows typical conformations generated with differ- 
ent (T", c) parameters and the conformation of maximally 
compact polymer. Figure |S1 shows the histogram of z a 
of the conformations at length 2,000 without weight ad- 
justment generated using different (T", c) parameters. It 
can be seen that the histograms for different values of 
(T',c) do not overlap. This feature demonstrated that 
with properly chosen (T',c), we can efficiently generate 
random polymers with z a within a targeted range. 

Packing Density of Random Chain Polymer. Figure^ 
shows the relationship of c a associated with each pairs of 
(T", c) as a function of chain length n. It also shows the 
c a value for the maximally compact conformations. Fig- 
ureOJ) shows the relationship of z a and n. Note that the 
targeted z a generated with different (T", c) parameters 
give rise to different z a ~ n scaling behavior. Because 
the coarse grained random polymers generated here lack 
side chains, they are fundamentally different from real 
proteins. We therefore have experimented with several 
(T', c) value. We find that protein-like scaling can be 
obtained for a wide range of (T',c) values. 

Figure |5Jd shows the average packing density pd for all 
conformations satisfying the constraint specified by dif- 
ferent (T", c) values. Except maximally compact confor- 
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FIG. 8: Self-avoiding walks generated with different (T',c) 
values do not overlap in z a values, (a). Conformations generated 
with (T',c) = (1.0,0.0); (b). (T',c) = (0.67,0.0); and (c). 
(T',c) = (0.1,0.6). Here the numbers of conformations are 
unweighted. The weighted average z a values are as shown in 
Figure^ at length n = 2,000. 



mations, the scaling of pd ~ n of all other sets of polymers 
is remarkably similar to that of protein (Figure . 

Conformations from Set 1 ((T',c) = (1.0,0.0)) are 
more extended, and have lower average z a (e.g., Fig- 
ure [7^). Because there are fewer voids, they also have 
high pd- Conformations in Set 3 ((T", c) = (0.1, 0.6)) are 
more compact and make more nonbonded contacts and 
hence have high z a values (e.g., Figure 0;). They also 
form more voids, and therefore have lower pd values. Set 
2 ((T',c) = (0.67,0.0)) are conformations whose proper- 
ties are between those of set 1 and set 3 (e.g., Figure0j). 



TABLE I: The relatinhsip between z a and n can be described 
by the equation z a = a — b/n. The estimated values of a and b, 
along with standard deviations in parenthesis, are listed for three 
different sets of conformations generated with different (T',c) 
parameters which characterize the sampling probability. The val- 
ues of a and b for real proteins are also listed. 
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FIG. 9: The relationship of alpha contact number n a , alpha 
coordination number z a , and the packing density pd of random 
compact and maximally compact self-avoiding walks. The curves 
are sampled following the function eT E , with different T' 
and c values. Each data point is an average of 10 runs of sample 
size 600. 



The relationship between z a and chain length n can be 
characterized by a nonlinear equation z a = a — b/n, sim- 
ilar to that of real proteins. The sets of a and b obtained 
by curve fitting are listed in Tabled We emphasize that 
these randomly generated self-avoiding walks are funda- 
mentally very different from proteins: all residues are of 
uniform size, there are no side chains, and there is no 
hydrophobic or any other type of physical interactins in 
these polymers. Because it is impossible to quantita- 
tively define a similarity metric that measures how dif- 
ferent these polymers are from proteins, we are not able 
to decide which specific values of (T',c) are optimal for 
modeling protein packing. Nevertheless, the scaling of 
Pd ~ n for all (T", c) values is qualitatively quite similar 
to that of real proteins. 

The relationship between pd and z a at chain length 
1,800 for self avoiding walks generated with different pa- 
rameters (T", c) are shown in Figure E3 The pd of both 
extended and maximally compact conformations have 
high pd values, but conformations with an intermediate 
value of z a contain voids and have smaller pd values. This 
is similar to the relationship of pd and a compactness pa- 
rameter p (equivalent to z a used here) studied in two 
dimensional lattice (see Figure 8 in referenced). 
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FIG. 10: The relationship of p c i and z a — n c /n for self-avoiding 
walks generated using different (T',c) parameters at chain length 
1,800. 



IV. SUMMARY AND DISCUSSION 

It is well acknowledged that protein has high packing 
density pd, as high as that of crystalline solids 8 . However, 
recent study suggested that there are numerous voids and 
pockets in proteinsAi. It was also found that about 1/3 
of the residues in a protein deviates from the FCC close 
packing and have random positions^. The simulation 
results presented here indicate that chain connectivity, 
excluded volume, and global compactness are the main 
determinants of the scaling behavior of voids and chain 
length in proteins. Unlike maximally compact polymers 
which maintains high packing density at all chain length, 
proteins and simple near compact polymers have large pd 
values only for relatively short chains. When the chain 
length reaches 190 residues for protein and about 600- 
700 for chain polymers, proteins and polymers have lower 
packing density and are quite tolerant to the formation 
of voids. 

The global compactness is a necessary condition for the 
observed protein-like scaling behavior. However, not all 
parameters related to voids and compactness are equally 
appropriate. The data shown in Fig suggests that 
the alpha coordination number z a reflects basic intrin- 
sic compactness properties of protein, which is absent in 
the widely used parameter R g , the radius of gyration. 
The advantage of parameters such as z a emphasizes the 
importance of accurate description of protein geometry 
and structure. 

The parameters pd is biased towards short chain pro- 
teins. By definition, a polymer formed by 2, 3 or a small 
number of monomers do not have long enough chains 
to form voids, therefore all will have pd = 1.0. When 
chain length becomes longer, voids appear. Similarly, z a 



s 



is also biased towards short chains. For very short chain 
polymers where the chain has few turns, few non-bonded 
contacts exist and no voids are formed. In this case, 2q is 
low and pd is high. However, this small size effect disap- 
pears rapidly for our model conformations for maximally 
compact polymers. Small size effect therefore does not 
fully account for the scaling behavior of pd and z a in 
proteins and in simulated random polymers. 

There are major differences between self-avoiding 
walks we generated and real protein structures. Our 
SAWs have no side chains, and belong to the coarse-grain 
model where one monomer is represented as a ball. In 
addition, the target distribution of SMC sampling is the 
truncated uniform distribution of all geometrically feasi- 
ble conformations. The truncation required is that poly- 
mers must satisfy a prescribed z a ~ n relationship. No 
physical forces such as hydrophobic interactions is used 
in the target distribution. 

In this paper, we describe a novel approach to over- 
come the attrition problem in generating long chain com- 
pact self-avoiding walk. With sequential importance 
sampling and resampling, we have developed an algo- 
rithm that effectively sample rare events, i.e., compact 
self-avoiding walks. Success in generating thousands of 
off-lattice self-avoiding walks satisfying various desired 
compactness requirement is essential for studying the 
scaling behavior of pd in random off-lattice self-avoiding 
walks. 

The main result of this paper is that with sequen- 
tial Monte Carlo techniques it is not difficult to repro- 
duce protein-like scaling behavior of packing density pd 
and chain length n in generic chain polymers. With the 
guidance of rudimentary requirement of z a , this can be 
achieved under a wide range of z a ~ n relationships. We 
therefore conclude that proteins retain the same pack- 
ing property of generic compact chain polymers. We 



further conclude that proteins are unlikely to be opti- 
mized by evolution to eliminated packing voids. This 
is in support of the insightful comments of Richards 
who suggested that an appropriate level of under-packing 
would be important for evolution to occur through ran- 
dom mutations^. 

Our study showed the importance of generic geometric 
packing related to z a in reproducing protein-like pd ~ n 
scaling behavior. To test further the role of geometric 
packing, the next step would be to examine the pd ~ n 
scaling by generating more realistic compact random 
polymers with perhaps monomers of different sizes to 
model the side chain effects. Furthermore, with sequen- 
tial Monte Carlo and other advanced sampling methods, 
various models of explicit side chains can be attached to 
main chain monomers. In addition, one could introduce 
various alphabet sets for the residues (such as the HP 
model) and corresponding potential energy function H. 
In this case, the target distribution can be the Boltzmann 
distribution tt cx exp(— H/T) instead of the uniform dis- 
tribution of all SAWs. It would also be interesting to 
examine the pd ~ n scaling of polymers of random se- 
quences and of protein-like sequences with low energy in 
compact states. 
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